Tolerating Arbitrary Failures With State Machine Replication
نویسندگان
چکیده
The growing reliance, in our daily lives, on services provided by distributed applications (e.g., air-traffic control, public switched telephone networks, electronic commerce, etc.) renders us vulnerable to the failures of these services. The challenge of fault tolerance consists in providing services that survive to the occurrence of failures. The design and verification of fault-tolerant distributed applications is however viewed as a difficult task. In recent years, several paradigms have fortunately been identified which simplify this task. Key among these paradigms is state machine replication [12, 15, 19]. The underlying idea is intuitively simple. In short, every crucial service that needs to be made fault tolerant is replicated on several computers that are supposed to fail independently. The presence of several replicas ensures the high availability of the service. To preserve the consistency of the service, invocations of its replicas, even if coming from different clients, are then handled in such a way that they reach the replicas in the same order. The abstraction that provides this guarantee is called the total order broadcast primitive. Roughly speaking, this communication primitive ensures that messages broadcast within a group of processes are delivered in the same order, despite concurrency and failures.
منابع مشابه
Contributions to Building Efficient and Robust State-Machine Replication Protocols
State machine replication (SMR) is a software technique for tolerating failures using commodity hardware. The critical service to be made fault-tolerant is modeled by a state machine. Several, possibly different, copies of the state machine are then deployed on different nodes. Clients of the service access the replicas through a SMR protocol which ensures that, despite concurrency and failures...
متن کاملPractical Hardening of Crash-Tolerant Systems
Recent failures of production systems have highlighted the importance of tolerating faults beyond crashes. The industry has so far addressed this problem by hardening crash-tolerant systems with ad hoc error detection checks, potentially overlooking critical fault scenarios. We propose a generic and principled hardening technique for Arbitrary State Corruption (ASC) faults, which specifically m...
متن کاملPaxos Replicated State Machines as the Basis of a High-Performance Data Store
Conventional wisdom holds that Paxos is too expensive to use for high-volume, high-throughput, data-intensive applications. Consequently, fault-tolerant storage systems typically rely on special hardware, semantics weaker than sequential consistency, a limited update interface (such as append-only), primary-backup replication schemes that serialize all reads through the primary, clock synchroni...
متن کاملFrom Viewstamped Replication to Byzantine Fault Tolerance
The paper provides an historical perspective about two replication protocols, each of which was intended for practical deployment. The first is Viewstamped Replication, which was developed in the 1980’s and allows a group of replicas to continue to provide service in spite of a certain number of crashes among them. The second is an extension of Viewstamped Replication that allows the group to s...
متن کاملResponsive Security for Stored Data
We present the design of a distributed store that offers various levels of security guarantees while tolerating a limited number of nodes that are compromised by an adversary. The store uses secret sharing schemes to offer security guarantees namely availability, confidentiality and integrity. However, a pure secret sharing scheme could suffer from performance problems and high access costs. We...
متن کامل